Association Rule Mining (ARM) and Networking Analysis¶

Introduction¶

In this part, I would apply Association Rule Mining (ARM) and networking analysis on the text data (Twitter response tweets), the dataset (tweetEN_clean.csv) used for ARM and networkiung can be found here. Moreover, the code of implementing these will also be shown along the process, and it can also be found here.

What is Networking?¶

A network is basically a set of objects connected to each other, similar to a spider web. The connections, or associations of the objects are displayed as the links or edges. The edges/links can be either directed or undirected, and the links can also be either weighted and unweighted. By looking at the directions and the weights of the edges/links, we will be able to dive deeper into the connections and relationships between the objects.

Typically, for N number of objects, there exists N^2 possible connections.

In this tab, a Python package named NetworkX will be used to create, manipulate, and analyze the structure, dynamics, and functions of the networks of the Twitter response text.

What is Association Rule Mining (ARM)?¶

Association Rule Mining (ARM) is an unsupervised and rule-based machine learning method for finding inter-relations between variables in large databases based on their statistical relevance. It is worth noting that association measures the co-occurrence of the objects instead of causality.

Given a training dataset, the goal of ARM is to discover rules that will forecast the existence of an object based on the appearance of other objects in the training data. In other words, ARM is designed to uncover the connections in the dataset and how strong those connections are.

Some real life applications of ARM are:

  • Market Basket Analysis: predict customers' purchase decisions based on the other items they choose, can be used for making decisions about marketing and promotional events
  • Natural Language Processing (NLP): discover associations in text data such as reviews (patient/customer feedback, movie/book reviews, etc), documents (novels, speeches, etc.), articles (academic articles, news, etc.), social posts (Twitter, Facebook, Instagram posts, etc.)
  • Image Analysis
  • Click Streams
  • Bio Data

Read Twitter Text Data¶

The dataset is the clean comma-seperated values (csv) file, it has 775 rows and 11 variables, including the following values of the Twitter response (Tweets) data: author_id (Tweet author ID), id (Tweet ID), created_at (content/tweet created date and time), text (original tweet content), clean_text (clean tweet content after punctuations, special characters, etc being removed), tweet_tokenized (tokenized clean tweet content), tweet_nonstop (tokenized clean tweet with stop words removed), tweet_stemmed (stemmed clean tweet content), tweet_lemmatized (lemmatized clean tweet content), sentiment (sentiment value of the tweet), label (labels generated based on the tweet sentiment value).

However, the only variable that will be used for the Association Rule Mining (ARM) and network analyis is tweet_lemmatized, which is the lemmatized tweet content after the text cleaning process. In other words, the lemmatized tweet tokens will be used as the "transaction data" here in order to explore the associations of the text data.

In [ ]:
### Import Relevant Packages
import json
import nltk
import string
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer

import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from apyori import apriori
import networkx as nx 

Below is the snapshot of how the clean tweet dataset looks like:

In [ ]:
tweetDF = pd.read_csv("tweetEN_clean.csv")
tweetDF.head()
Out[ ]:
author_id id created_at text clean_text tweet_tokenized tweet_nonstop tweet_stemmed tweet_lemmatized sentiment label
0 1116548763168858112 1575191634026717184 2022-09-28T18:32:05.000Z RT @VSkirbekk: Gradual convergence in fertilit... gradual convergence fertility between china we... ['gradual', 'convergence', 'fertility', 'betwe... ['gradual', 'convergence', 'fertility', 'china... ['gradual', 'converg', 'fertil', 'china', 'wes... ['gradual', 'convergence', 'fertility', 'china... 0.0000 neutral
1 1364997075851599873 1575190110920114178 2022-09-28T18:26:02.000Z RT @nytimes: South Korea has had the world's l... south korea world lowest total fertility rate ... ['south', 'korea', 'world', 'lowest', 'total',... ['south', 'korea', 'world', 'lowest', 'total',... ['south', 'korea', 'world', 'lowest', 'total',... ['south', 'korea', 'world', 'lowest', 'total',... 0.2415 positive
2 1231317688288468994 1575189759550693377 2022-09-28T18:24:38.000Z RT @nytimes: South Korea has had the world's l... south korea world lowest total fertility rate ... ['south', 'korea', 'world', 'lowest', 'total',... ['south', 'korea', 'world', 'lowest', 'total',... ['south', 'korea', 'world', 'lowest', 'total',... ['south', 'korea', 'world', 'lowest', 'total',... 0.2415 positive
3 780710885186674688 1575188100078133248 2022-09-28T18:18:03.000Z RT @nytimes: South Korea has had the world's l... south korea world lowest total fertility rate ... ['south', 'korea', 'world', 'lowest', 'total',... ['south', 'korea', 'world', 'lowest', 'total',... ['south', 'korea', 'world', 'lowest', 'total',... ['south', 'korea', 'world', 'lowest', 'total',... 0.2415 positive
4 178464094 1575186284808531969 2022-09-28T18:10:50.000Z RT @koryodynasty: Always trust men to come up ... always trust come with policies women this cas... ['always', 'trust', 'come', 'with', 'policies'... ['always', 'trust', 'come', 'policies', 'women... ['alway', 'trust', 'come', 'polici', 'women', ... ['always', 'trust', 'come', 'policy', 'woman',... 0.7184 positive

Below are some basic information of the clean tweet dataset:

In [ ]:
tweetDF.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 775 entries, 0 to 774
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   author_id         775 non-null    int64  
 1   id                775 non-null    int64  
 2   created_at        775 non-null    object 
 3   text              775 non-null    object 
 4   clean_text        769 non-null    object 
 5   tweet_tokenized   775 non-null    object 
 6   tweet_nonstop     775 non-null    object 
 7   tweet_stemmed     775 non-null    object 
 8   tweet_lemmatized  775 non-null    object 
 9   sentiment         775 non-null    float64
 10  label             775 non-null    object 
dtypes: float64(1), int64(2), object(8)
memory usage: 66.7+ KB

Reformatting Data for Network Analysis¶

The chunck of code below extracted each tweet_lemmatized element, making each row's element a list, and then put the list inside a list, so the format of will be looking like this:

[[lemmatized_tweet_1], [lemmatized_tweet_2], [lemmatized_tweet_3], ...]

In [ ]:
import ast

tweets = list(tweetDF["tweet_lemmatized"])

out = []

for twt in tweets:
    out.append(ast.literal_eval(twt))
    
tweets = out

The chunk of code below reformats the apriori output into a pandas dataframe with columns "rhs", "lhs","supp", "conf", "supp x conf", "lift":

  • supp (support): the support of A and B, Sup(A, B), measures how often item-set with A and items in B occur together relative to all other transactions, which scales how common an item-set is: (1=very important, 0=irrelecvant)

  • conf (confidence): the confidence of A and B, conf(A, B), measures how often items in A and items in B occur together, relative to transactions that contain A, which scales how statistically “strong” a rule is (1=strong rule, Y is bought everytime X is, 0 = no instance of rule occuring)

  • supp x conf (support x confidence): large support means frequently occurring rules, and large confidence means strong rules. Hence, a large products suggests the rule is frequent and strong

  • lift: the lift of a rule is the ratio of the observed support to that expected if A and B were independent

In [ ]:
# Reformat Output

# INSERT CODE TO RE-FORMAT THE APRIORI OUTPUT INTO A PANDAS DATA-FRAME WITH COLUMNS 
# "rhs","lhs","supp","conf","supp x conf","lift"
def reformat_results(results):
    # clean-up results 
    keep = []
    for i in range(0, len(results)):
        for j in range(0, len(list(results[i]))):
            if (j > 1):
                for k in range(0, len(list(results[i][j]))):
                    if (len(results[i][j][k][0]) != 0):
                        # print(len(results[i][j][k][0]), results[i][j][k][0])
                        rhs = list(results[i][j][k][0])
                        lhs = list(results[i][j][k][1])
                        conf = float(results[i][j][k][2])
                        lift = float(results[i][j][k][3])
                        keep.append([rhs, lhs, supp, conf, supp*conf, lift])
            if (j == 1):
                supp = results[i][j]
                
    return pd.DataFrame(keep, columns = ["rhs","lhs","supp","conf","supp x conf","lift"])

The chunk of code below converts the dataframe of the apriori output to a NetworkX object:

In [ ]:
# Utility function: Convert to NetworkX object

def convert_to_network(df):
    print(df)

    #BUILD GRAPH
    G = nx.DiGraph()  # DIRECTED
    for row in df.iterrows():
        # for column in df.columns:
        lhs="_".join(row[1][0])
        rhs="_".join(row[1][1])
        conf=row[1][3]; #print(conf)
        if(lhs not in G.nodes): 
            G.add_node(lhs)
        if(rhs not in G.nodes): 
            G.add_node(rhs)

        edge=(lhs,rhs)
        if edge not in G.edges:
            G.add_edge(lhs, rhs, weight=conf)

    # print(G.nodes)
    # print(G.edges)
    return G

The code chunk below is a function that plots the NetworkX object into network graphs (nodes and edges):

In [ ]:
# Utility function: Plot NetworkX object

def plot_network(G):
    #SPECIFIY X-Y POSITIONS FOR PLOTTING
    pos=nx.random_layout(G)

    #GENERATE PLOT
    fig, ax = plt.subplots()
    fig.set_size_inches(15, 15)

    #assign colors based on attributes
    weights_e 	= [G[u][v]['weight'] for u,v in G.edges()]

    #SAMPLE CMAP FOR COLORS 
    cmap=plt.cm.get_cmap('Blues')
    colors_e 	= [cmap(G[u][v]['weight']*10) for u,v in G.edges()]

    #PLOT
    nx.draw(
    G,
    edgecolors="black",
    edge_color=colors_e,
    node_size=2000,
    linewidths=2,
    font_size=8,
    font_color="white",
    font_weight="bold",
    width=weights_e,
    with_labels=True,
    pos=pos,
    ax=ax
    )
    ax.set(title='Twitter Response (Fertility Hashtag)')
    plt.show()

# raise
In [ ]:
print("Transactions:\n", pd.DataFrame(out[:5]))
Transactions:
         0            1          2       3        4          5          6   \
0  gradual  convergence  fertility   china  western    country       None   
1    south        korea      world  lowest    total  fertility       rate   
2    south        korea      world  lowest    total  fertility       rate   
3    south        korea      world  lowest    total  fertility       rate   
4   always        trust       come  policy    woman       case  ingenious   

     7      8      9     10     11     12         13  
0  None   None   None  None   None   None       None  
1  year  mayor  seoul  said  nanny  would  encourage  
2  year  mayor  seoul  said  nanny  would  encourage  
3  year  mayor  seoul  said  nanny  would  encourage  
4  plan   baby  boost  None   None   None       None  

Train the ARM model using the apriori package, and fit the model on the text data:

In [ ]:
# INSERT CODE TO TRAIN THE ARM MODEL USING THE "apriori" PACKAGE
#results = list(apriori(out, min_support=0.003, min_confidence=0.02, min_length=1, max_length=5))
#results = list(apriori(transactions, min_support=0.005, min_confidence=0.05, min_length=1, max_length=5))
results = list(apriori(out, min_support = 0.05, min_confidence = 0.3, min_lift = 4, min_length = 4, max_length = 5))
print(len(results))
4255

Plotting NetworkX Objects¶

The chunk of code will perform the following tasks:

  • reformatting text (transaction) data
  • converting reformatted text data to NetworkX object
  • plotting the NetworkX object as network graphs that contains links and nodes
In [ ]:
# INSERT CODE TO PLOT THE RESULTS AS A NETWORK-X OBJECT 
pd_results = reformat_results(results[:50])
G = convert_to_network(pd_results)
plot_network(G)
            rhs          lhs      supp      conf  supp x conf       lift
0      [always]       [baby]  0.068387  0.981481     0.067121  12.469642
1        [baby]     [always]  0.068387  0.868852     0.059418  12.469642
2      [always]      [boost]  0.068387  0.981481     0.067121  14.086077
3       [boost]     [always]  0.068387  0.981481     0.067121  14.086077
4      [always]       [case]  0.068387  0.981481     0.067121  13.114623
..          ...          ...       ...       ...          ...        ...
95     [record]     [factor]  0.161290  0.892857     0.144009   5.405971
96     [factor]  [shattered]  0.161290  0.976562     0.157510   5.866945
97  [shattered]     [factor]  0.161290  0.968992     0.156289   5.866945
98  [ingenious]       [plan]  0.068387  1.000000     0.068387  14.351852
99       [plan]  [ingenious]  0.068387  0.981481     0.067121  14.351852

[100 rows x 6 columns]
In [ ]:
pd_results = reformat_results(results[51:100])
G = convert_to_network(pd_results)
plot_network(G)
                  rhs               lhs      supp      conf  supp x conf  \
0         [ingenious]           [trust]  0.068387  1.000000     0.068387   
1             [trust]       [ingenious]  0.068387  1.000000     0.068387   
2         [ingenious]           [woman]  0.068387  1.000000     0.068387   
3             [woman]       [ingenious]  0.068387  0.456897     0.031246   
4             [mayor]           [nanny]  0.101935  0.766990     0.078184   
..                ...               ...       ...       ...          ...   
169            [case]  [always, policy]  0.068387  0.913793     0.062492   
170          [policy]    [always, case]  0.068387  0.828125     0.056633   
171    [always, case]          [policy]  0.068387  1.000000     0.068387   
172  [always, policy]            [case]  0.068387  1.000000     0.068387   
173    [policy, case]          [always]  0.068387  1.000000     0.068387   

          lift  
0    14.622642  
1    14.622642  
2     6.681034  
3     6.681034  
4     7.248994  
..         ...  
169  13.362069  
170  12.109375  
171  12.109375  
172  13.362069  
173  14.351852  

[174 rows x 6 columns]
In [ ]:
pd_results = reformat_results(results[101:150])
G = convert_to_network(pd_results)
plot_network(G)
                 rhs              lhs      supp      conf  supp x conf  \
0           [always]    [case, woman]  0.068387  0.981481     0.067121   
1             [case]  [always, woman]  0.068387  0.913793     0.062492   
2            [woman]   [always, case]  0.068387  0.456897     0.031246   
3     [always, case]          [woman]  0.068387  1.000000     0.068387   
4    [always, woman]           [case]  0.068387  1.000000     0.068387   
..               ...              ...       ...       ...          ...   
289           [case]   [trust, boost]  0.068387  0.913793     0.062492   
290          [trust]    [boost, case]  0.068387  1.000000     0.068387   
291    [boost, case]          [trust]  0.068387  1.000000     0.068387   
292   [trust, boost]           [case]  0.068387  1.000000     0.068387   
293    [trust, case]          [boost]  0.068387  1.000000     0.068387   

          lift  
0    14.351852  
1    13.362069  
2     6.681034  
3     6.681034  
4    13.362069  
..         ...  
289  13.362069  
290  14.622642  
291  14.622642  
292  13.362069  
293  14.351852  

[294 rows x 6 columns]
In [ ]:
pd_results = reformat_results(results[151:200])
G = convert_to_network(pd_results)
plot_network(G)
                    rhs                 lhs      supp      conf  supp x conf  \
0               [boost]   [come, ingenious]  0.068387  0.981481     0.067121   
1                [come]  [boost, ingenious]  0.068387  0.679487     0.046468   
2           [ingenious]       [come, boost]  0.068387  1.000000     0.068387   
3         [come, boost]         [ingenious]  0.068387  1.000000     0.068387   
4    [boost, ingenious]              [come]  0.068387  1.000000     0.068387   
..                  ...                 ...       ...       ...          ...   
241              [come]       [trust, case]  0.068387  0.679487     0.046468   
242             [trust]        [come, case]  0.068387  1.000000     0.068387   
243        [come, case]             [trust]  0.068387  0.981481     0.067121   
244       [trust, case]              [come]  0.068387  1.000000     0.068387   
245       [trust, come]              [case]  0.068387  1.000000     0.068387   

          lift  
0    14.351852  
1     9.935897  
2    14.622642  
3    14.622642  
4     9.935897  
..         ...  
241   9.935897  
242  14.351852  
243  14.351852  
244   9.935897  
245  13.362069  

[246 rows x 6 columns]
In [ ]:
pd_results = reformat_results(results[1501:1550])
G = convert_to_network(pd_results)
plot_network(G)
                       rhs                     lhs      supp      conf  \
0                   [said]  [lowest, seoul, total]  0.096774  0.833333   
1                  [seoul]   [lowest, said, total]  0.096774  0.757576   
2                  [total]   [lowest, said, seoul]  0.096774  0.903614   
3           [lowest, said]          [seoul, total]  0.096774  1.000000   
4          [lowest, seoul]           [said, total]  0.096774  0.986842   
..                     ...                     ...       ...       ...   
509         [mayor, world]          [nanny, south]  0.099355  0.962500   
510         [nanny, south]          [mayor, world]  0.099355  0.974684   
511         [world, nanny]          [mayor, south]  0.099355  1.000000   
512  [mayor, world, south]                 [nanny]  0.099355  0.962500   
513  [world, nanny, south]                 [mayor]  0.099355  1.000000   

     supp x conf       lift  
0       0.080645   8.497807  
1       0.073314   7.828283  
2       0.087447   9.337349  
3       0.096774  10.197368  
4       0.095501  10.197368  
..           ...        ...  
509     0.095629   9.442247  
510     0.096840   9.442247  
511     0.099355   7.524272  
512     0.095629   9.096799  
513     0.099355   7.524272  

[514 rows x 6 columns]
In [ ]:
pd_results = reformat_results(results[1001:1050])
G = convert_to_network(pd_results)
plot_network(G)
                           rhs                        lhs      supp      conf  \
0                  [encourage]      [nanny, total, south]  0.096774  0.961538   
1                      [nanny]  [encourage, total, south]  0.096774  0.914634   
2                      [total]  [encourage, nanny, south]  0.096774  0.903614   
3           [encourage, nanny]             [total, south]  0.096774  1.000000   
4           [encourage, south]             [nanny, total]  0.096774  1.000000   
..                         ...                        ...       ...       ...   
549              [year, seoul]         [encourage, south]  0.096774  1.000000   
550              [year, south]         [encourage, seoul]  0.096774  0.914634   
551  [encourage, seoul, south]                     [year]  0.096774  1.000000   
552   [encourage, year, south]                    [seoul]  0.096774  1.000000   
553       [year, seoul, south]                [encourage]  0.096774  1.000000   

     supp x conf       lift  
0       0.093052   9.805162  
1       0.088513   9.451220  
2       0.087447   9.337349  
3       0.096774   9.810127  
4       0.096774  10.197368  
..           ...        ...  
549     0.096774  10.333333  
550     0.088513   9.451220  
551     0.096774   4.813665  
552     0.096774   7.828283  
553     0.096774   9.935897  

[554 rows x 6 columns]
In [ ]:
pd_results = reformat_results(results[3831:3850])
G = convert_to_network(pd_results)
plot_network(G)
                               rhs                            lhs      supp  \
0                          [mayor]   [lowest, seoul, said, total]  0.096774   
1                           [said]  [mayor, lowest, total, seoul]  0.096774   
2                          [seoul]   [mayor, lowest, said, total]  0.096774   
3                          [total]   [mayor, lowest, said, seoul]  0.096774   
4                  [lowest, mayor]           [seoul, said, total]  0.096774   
..                             ...                            ...       ...   
459          [mayor, world, total]                [lowest, seoul]  0.098065   
460          [world, seoul, total]                [mayor, lowest]  0.098065   
461  [lowest, world, mayor, seoul]                        [total]  0.098065   
462  [lowest, world, mayor, total]                        [seoul]  0.098065   
463  [lowest, world, seoul, total]                        [mayor]  0.098065   

         conf  supp x conf      lift  
0    0.728155     0.070467  7.524272  
1    0.833333     0.080645  8.497807  
2    0.757576     0.073314  7.828283  
3    0.903614     0.087447  9.337349  
4    0.949367     0.091874  9.810127  
..        ...          ...       ...  
459  0.962025     0.094341  9.810127  
460  1.000000     0.098065  9.810127  
461  1.000000     0.098065  9.337349  
462  0.962025     0.094341  7.531006  
463  1.000000     0.098065  7.524272  

[464 rows x 6 columns]

Interpretation and Conclusion¶

It is worth noting that the network plot above only displays partial relations of partial parts of words in the Twitter Response dataset. Due to the reason that the size of the result after fitting Apriori model on the provided data is too large, it is time-consuming and nearly impossible to render and display all the relations and connections existed in the lemmatized Tweet data in a single plot. Therefore, I divided the result set into several subsets and randomly selected some of the subsets to plot.

From the plots we can also see that although there exists numerous words and connections in the lemmatized tweet dataset, those keywords (nodes) and connections (links) seem to be highly repetitive. In other words, the insights that can be derived from this network analysis are very limited due to the reason that many Tweets in the dataset are reposts, which means the content can be highly repetitive.

However, there still exists valuable observations that can be extracted from the current information we have.

Based on the network plots above, we can see that there exists strong connections between the keywords "baby", "woman", "boost", "policy", and "seoul". From these words, we can assume a possible scenario that the city of Seoul (South Korea) might try to boost the baby amount by improving their policies of woman rights. Moreover, words like "woman", "plan", "policy", and "encourage" seem to be always connected, which indicates the possibilty that all or many East Asian countries/regions might plan to modify or introduce some kind of policies to encourage women to give birth in order boost the fertility rate of the area.